08. Source: Web Scraping

Source: Web Scraping

Introduction

Source: Scraping Webpages

Scraping Webpages

The two main ways to work with HTML files are:

  • Saving the HTML file to your computer (using the Requests library for example) library and reading that file into a BeautifulSoup constructor
  • Reading the HTML response content directly into a BeautifulSoup constructor (again using the Requests library for example)

You'll learn how this Requests code works under the hood shortly in “Downloading Files from The Internet.”

For this lesson, you’re going to do neither of these. I've downloaded all of the Rotten Tomatoes HTML files for you and put them in a folder called rt_html in the Jupyter Notebooks in the Udacity classroom. If you want to work outside of the classroom, download this zip file and extract the rt_html folder. I recommend that you do and open the HTML files in your preferred text editor (e.g. Sublime , which is free) to inspect the HTML for the quizzes ahead.

The rt_html folder contains the Rotten Tomatoes HTML for each of the Top 100 Movies of All Time as the list stood at the most recent update of this lesson. I'm giving you these historical files because the ratings will change over time and there will be inconsistencies with the recorded lesson videos. Also, a web page's HTML is known to change over time. Scraping code can break easily when web redesigns occur, which makes scraping brittle and not recommended for projects with longevity. So just use these HTML files provided to you and pretend like you saved them yourself with one of the methods described above.

More Information